Skip to content

Conversation

@justaparth
Copy link
Contributor

@justaparth justaparth commented Nov 12, 2023

What changes were proposed in this pull request?

JIRA: https://issues.apache.org/jira/browse/SPARK-43427

Protobuf supports unsigned integer types, including uint32 and uint64. When deserializing protobuf values with fields of these types, from_protobuf currently transforms them to the spark types of:

uint32 => IntegerType
uint64 => LongType

IntegerType and LongType are signed integer types, so this can lead to confusing results. Namely, if a uint32 value in a stored proto is above 2^31 or a uint64 value is above 2^63, their representation in binary will contain a 1 in the highest bit, which when interpreted as a signed integer will be negative (I.e. overflow). No information is lost, as IntegerType and LongType contain 32 and 64 bits respectively, however their representation can be confusing.

In this PR, we add an option (upcast.unsigned.ints) to allow upcasting unsigned integer types into a larger integer type that can represent them natively, i.e.

uint32 => LongType
uint64 => Decimal(20, 0)

I added an option so that it doesn't break any existing clients.

Example of current behavior

Consider a protobuf message like:

syntax = "proto3";

message Test {
  uint64 val = 1;
}

If we compile the above and then generate a message with a value for val above 2^63:

import test_pb2

s = test_pb2.Test()
s.val = 9223372036854775809 # 2**63 + 1
serialized = s.SerializeToString()
print(serialized)

This generates the binary representation:

b'\x08\x81\x80\x80\x80\x80\x80\x80\x80\x80\x01'

Then, deserializing this using from_protobuf, we can see that it is represented as a negative number. I did this in a notebook so its easier to see, but could reproduce in a scala test as well:

image

Precedent
I believe that unsigned integer types in parquet are deserialized in a similar manner, i.e. put into a larger type so that the unsigned representation natively fits. https://issues.apache.org/jira/browse/SPARK-34817 and #31921. So an option to get similar behavior would be useful.

Why are the changes needed?

Improve unsigned integer deserialization behavior.

Does this PR introduce any user-facing change?

Yes, adds a new option.

How was this patch tested?

Unit Testing

Was this patch authored or co-authored using generative AI tooling?

No

@justaparth justaparth changed the title [SPARK-43427][Protobuf] spark protobuf: modify serde behavior of unsigned integer types [SPARK-43427][Protobuf] spark protobuf: allow upcasting unsigned integer types Nov 12, 2023
@justaparth justaparth force-pushed the parth/43427-add-option-to-expand-unsigned-integers branch from 599b016 to 94861f8 Compare November 12, 2023 22:34
@justaparth justaparth changed the title [SPARK-43427][Protobuf] spark protobuf: allow upcasting unsigned integer types [SPARK-43427][PROTOBUF] spark protobuf: allow upcasting unsigned integer types Nov 12, 2023
@justaparth justaparth force-pushed the parth/43427-add-option-to-expand-unsigned-integers branch from 94861f8 to 9ba5d2f Compare November 12, 2023 23:13
@justaparth
Copy link
Contributor Author

@rangadi do you mind taking a look? i've modified this to make it an option so that the default behavior doesn't change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I wouldn't add this. It would be problematic when Spark has unsigned types. For the same reason, Parquet also doesn't support unsigned physical types for Spark.

Copy link
Contributor Author

@justaparth justaparth Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be problematic when Spark has unsigned types. For the same reason, Parquet also doesn't support unsigned physical types for Spark.

hey, i'm not sure if i follow; do you mind explaining what you mean by this?

My goal here is to add an option allowing unsigned 32 and 64 bit integers coming from protobuf to be represented in a type that can contain them without overflow. I mention this in the description, but I actually modeled my code off of how the parquet code today is written, which i believe is doing this same thing by default:

https://github.com/justaparth/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L243-L270

https://github.com/justaparth/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala#L345-L351

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it was added at here #31921. My memory was old.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am less certain about over all Spark context, but for from_protobuf() this looks fine so far (since it is only enabled with an option).

@justaparth
Copy link
Contributor Author

bump on this @rangadi in case you get a chance 🙏

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems at least coherent at high level although I don't know the details in the code

@justaparth
Copy link
Contributor Author

@rangadi do you mind taking another look here?

Copy link

@rangadi rangadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. @HyukjinKwon should we wait from more reviews?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use protobufOptions.upcastUnsignedInts directly below.
Or add 'option' to this variable's name. It is more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, i've inlined it 👍

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to add a log here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh like a print statement? im not sure, would that be helpful?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am less certain about over all Spark context, but for from_protobuf() this looks fine so far (since it is only enabled with an option).

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine. LGTM too

Signed-off-by: Parth Upadhyay <parth.upadhyay@gmail.com>
@justaparth justaparth force-pushed the parth/43427-add-option-to-expand-unsigned-integers branch from 9ba5d2f to 11ff320 Compare December 11, 2023 21:03
@justaparth justaparth requested a review from rangadi December 11, 2023 21:05
Copy link

@rangadi rangadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @justaparth.
@HyukjinKwon please merge when you get a chance.

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants